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Abstract. It is well known that the NIST statistical test suite was used 
for the evaluation of AES candidate algorithms. We have found that the 
test setting of Discrete Fourier Transform test and Lempel-Ziv test of 
this test suite are wrong. We give four corrections of mistakes in the test 
settings. This suggests that re-evaluation of the test results should be 
needed. 
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1 Introduction 

Random and pseudorandom bit generators (RBGs, PRBGs) are used for many 
purposes including cryptographic, modeling, and simulation applications. For 
cryptographic purpose, they are required in the construction of encryption keys, 
other cryptographic parameters, and so on. One of the criteria used to evalu- 
ate the Advanced Encryption Standard (AES) candidate algorithms was their 
demonstrated suitability as PRBGs. That is, the evaluation of their outputs uti- 
lizing statistical tests should not provide any means by which to computationally 
distinguish them from truly random sources [1-3] . 

Cryptographically secure pseudorandom bit generator is defined as a PRBG 
that passes the next-bit test [4]. A PRBG is said to pass the next-bit test if 
there is no polynomial-time algorithm which, on input of the first I bits of an 
output sequence s, can predict the (7+l)st bit of s with probability significantly 
greater than i. It is known that a PRBG passes the next-bit test if and only 
if it passes all polynomial-time statistical tests. Although a few PRBGs such as 
RSA, BBS are known as cryptographically secure PRBGs under the assumption 
that RSA problem and integer factorization are intractable, it is difficult to 
prove that some PRBG is cryptographically secure in general. Practically, we 
only subject a sample output sequence of the PRBG to various statistical tests, 
and evaluate that the sequence possesses a certain attribute that a truly random 
sequence would be likely to exhibit. Although various kind of statistical tests are 
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proposed so far [5-7], we will focus on NIST 800-22 statistical test suite [8] in 
this paper because this test suite was used for the evaluation of AES candidates. 

Some statistical tests are based on a statistical hypothesis H which is that 
a given binary sequence was produced by a random bit generator. The test only 
provides P-value which is a measure of the strength of the evidence provided by 
the data against the hypothesis. The significance level a of the test of a statistical 
hypothesis H is the probability of rejecting H when it is true. If P-valuc > 
a, then the hypothesis H is accepted, i.e., the sequence would be considered 
to be random with a confidence 1 — a. If P-value < a, then the hypothesis 
H is rejected, i.e., the sequence would be considered to be non-random with a 
confidence 1 — a. 

If the significance level a of a test of H n is too high, then the test may reject 
sequences that were, in fact, produced by a random bit generator (such an error 
is called a Type I error) . On the other hand, if the significance level a of a test of 
H is too low, then there is the danger that the test may accept sequences even 
though they were not produced by a random bit generator (such an error is called 
a Type II error). It is, therefore, important that the test be carefully designed 
to have a significance level that appropriate for the purpose at hand. However, 
the calculation of the Type II error is more difficult than the calculation of a 
because many possible types of non-randomness may exists. Therefore, NIST 
statistical test suite, which includes 16 tests, adopts two further analyses in 
order to minimize the probability of accepting a sequence being produced by a 
good generator when the generator was actually bad [9]. First, For each test, 
a set of sequences (sample size m) from output is subjected to the test, and 
the proportion of sequences whose corresponding P-value satisfies P-value > a 
is calculated. If the proportion (success rate) is close to 1 — a, then the test 
is passed, i.e., the set of sequences is accepted. Second, the distribution of P- 
values is calculated for each test. And, if these P-value are uniformly distributed 
(no obvious bias), then the test is passed. These two analyses are the crucial 
difference from the other statistical test suite. 

In section 2, we investigate the randomness of sequences generated by various 
PRBGs including cellular automata (CA)-based PRBG using the statistical test 
suite provided by NIST, and show that results of Discrete Fourier Transform 
(DFT) test and Lempel-Ziv Compression test are strange. This suggests that 
the NIST test setting of these two tests are wrong. In fact, we identify two 
mistakes in the NIST setting of DFT test in section 3. We also identify two 
mistakes in the NIST setting of Lempel-Ziv test in section 4. The corrections are 
also given in each section. This study is important because this NIST test suite 
was used for the evaluation of AES candidates. 

1.1 NIST Statistical Test Suite 

The NIST statistical test suite is a statistical package consisting of 16 tests 
that were developed to test the randomness of arbitrary long binary sequences 
produced by either hardware or software based cryptographic random or pseu- 
dorandom number generators. These tests focus on a variety different types of 
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non-randomness that could exist in a sequence. The 16 tests are listed in Table 
1. 



Table 1. List of NIST Statistical Tests 



Number 


Test Name 


1 


Frequency 


2 


Block Frequency 


3 


Runs 


4 


Longest Run 


5 


Binary Matrix Rank 


6 


Discrete Fourier Transform 


7 


Non-overlapping Template Matching 


8 


Overlapping Template Matching 


9 


Universal 


10 


Lempel Ziv Compression 


11 


Linear Complexity 


12 


Serial 


13 


Approximate Entropy 


14 


Cumulative Sums 


15 


Random Excursions 


16 


Random Excursions Variant 



For each statistical test, a set of P- values, which is corresponding to the set 
of sequences, is produced. Each sequence is called success if the corresponding 
P-value satisfies the condition P-value > a, and is called failure otherwise. For 
a fixed significance level a, 100a % of P- values are expected to indicate failure 1 . 
For the interpretation of test results, NIST adopts following two approaches, 

(1) the examination of the proportion of success-sequences (Success Rate) 

If the proportion of success-sequences falls outside of following acceptable 
interval, there is evidence that the data is non-random. 

V m 

where P 1 = 1 — a and m is the number of sequences. This interval is determined 
to be 99.73% range of normal distribution which is an approximation of the 
binomial distribution under the assumption that each sequence is independent 
sample. 

(2) uniformity of the distribution of P-values 

1 All the statistical tests of the NIST statistical test suite have the unique significance 
level a = 0.01. 



(1) 
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This examination is accomplished by computing following \ 2 value, 

X m/10 ' k j 

where i*i is the number of P-values in sub-interval [(i-l)*0.1, i*0.1), and m is the 
number of sequences (sample size). And, the P- value of P-values is calculated 
such that P'-value = igamc (9/2,x 2 /2), where igamc(n,x) is the Incomplete 
Gamma Function. If P'-value > 0.0001, then the set of P-values can be considered 
to be uniformly distributed. 



2 Results of the NIST Statistical Test Suite 

In this section, we show the results of the NIST statistical test suite for four 
PRBGs (AES, SHA1, MUGI, and CA). For each statistical test, two further 
analyses described above are executed, and evaluate the set of sequences. We 
use 1000 samples of 10 6 bit sequences for each test. Consequently, 10 (keys) x 
1000 (sample) x 10 6 (sequence) bits are used for each test in order to investigate 
the difference of the results between different keys 2 . The input parameters we 
use are listed in Table 2. 



Table 2. Parameters used for NIST Test Suite 



Test Name 


Block Length 


Block Frequency 


20,000 


Non-overlapping Template Matching 


9 


Overlapping Template Matching 


9 


Universal (Initialization Steps) 


7 (1280) 


Linear Complexity 


500 


Serial 


10 


Approximate Entropy 


10 



Table 3 shows the results of AES (128 bit key, OFB mode). All 16 tests arc 
passed in four cases (key 1, key 2, key 4, and key 8). The success rates of the best 
case (key 1) and of the worst case (key 7) are shown in Figure 1. Dotted lines 
denote the acceptable interval specified by eq.(l). As we can see, some tests have 
many success rates. For example, the non-overlapping template matching test 
(number 7) has 148 success rates because one success rate corresponds to the 
one template (non-periodic pattern consisting of 9 bits) matching. If at least one 
success rates is out of the acceptable interval, then the test is not passed (see key 
7 case). While all tests are passed in key 1 case, the non-overlapping template 

2 The key is the initial configuration {Sj^ } in CA case. 
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Table 3. Results of AES. 



Key 


Success Rate 


Uniformity 


1 


pass 


pass 


2 


pass 


pass 


3 


15 


pass 


4 


pass 


pass 


5 


7 


10 


6 


14 


10 


7 


7, 8 


pass 


8 


pass 


pass 


9 


pass 


10 


10 


pass 


10 
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Test Number 



Fig. 1. Success rates of AES for 16 tests. Key 1 (best) and key 7 (worst) cases are 
shown in up and down figures, respectively. Dotted lines denote the acceptable interval 
(eq.(l) with a = 0.01). 



matching test (number 7) and the overlapping template matching test (number 
8) are not passed in key 7 case. It is noted that the uniformity of P-values are 
not passes only for the Lempel-Ziv test (number 10). The reason why this test 
is not passed frequently will be explained later. 

A one-dimensional 5-neighborhood CA consist of a line of cells with value 
Si= or 1 for i = 0, 1, 2, • • • , N. These cell values are updated in parallel in 
discrete time steps according to a fixed rule of the form, 

C^+l 771/ qt qt qt qt qt \ /o\ 

where Sj denotes the i cell value at time t [10-12]. We use following rule 
535945230 as a CA-based PRBG [13]. 

qt+l _ at m qt m qt m 

qt . qt as qt . qt ft.qt.qt ffi 
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&i ' $i+2 © ' $i+2 © (4) 

Qt Qt Qt Qt Qt Qt 

Table 4, 5 and 6 show the results of SHAl, MUGI, and CA, respectively. In CA 
case, we use the cell values {5 1 *} with fixed cell number i as a sequence, and also 
use the system size N — 1000 and periodic boundary condition S{ — Sjf +1 . 



Table 4. Results of SHAl 



Key 


Success Rate 


Uniformity 


1 


pass 


pass 


2 


pass 


10 


3 


7 


pass 


4 


7 


6 


5 


pass 


10 


6 


7, 15, 16 


pass 


7 


7 


pass 


8 


7 


pass 


9 


pass 


pass 


10 


pass 


10 



Table 5. Results of MUGI 



Key 


Success Rate 


Uniformity 


1 


7 


pass 


2 


pass 


10 


3 


10 


10 


4 


pass 


pass 


5 


7 


pass 


6 


pass 


pass 


7 


pass 


pass 


8 


pass 


pass 


9 


7 


pass 


10 


pass 


6 



As we can see, all tests are passed in two cases (SHAl), in four cases (MUGI), 
and six cases (CA), respectively. It is noted that results of CA-535945230 case 
is better than the cases of well-known good PRBGs such as AES, SHAl, and 
MUGI. 

If we focus on the uniformity of P-values, only the DFT test (number 6) 
and Lempel-Ziv test (number 10) are not passed frequently. If we choose the 
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Table 6. Results of CA-535945230 



Key 


Success Rate 


Uniformity 


1 


pass 


pass 


2 


pass 


10 


3 


pass 


pass 


4 


pass 


6, 10 


5 


pass 


pass 


6 


pass 


pass 


7 


pass 


7 


8 


pass 


pass 


9 


pass 


10 


10 


pass 


pass 



sample size m greater than 10000, we cannot find any PRBGs that pass these 
two tests even in SHA1 (SHA1 is used for the mean- value and the variance- value 
in the distribution of the Lempel-Ziv test [8]). Figure 2 shows that P'-values (the 
uniformity of the distribution of P-values) of these two tests rapidly decrease as 
the number of samples increases. In other words, these distributions of P-valucs 
indicate a apparent deviation from randomness although we use well-known good 
PRBG (SHA1). This observation suggests that these two tests can be consider as 




Fig. 2. The uniformity of P-values in SHA1 case. 



an underdeveloped statistical test. Since many statistical tests are based upon 
asymptotic approximations, careful work needs to be done to determine how 
good an approximation is. However, we originally found that these two tests 
have not only approximation problem but also mistakes in theoretical setting. 
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3 Corrections of Discrete Fourier Transform (Spectral) 
Test 

In this section, we focus on the DFT test, and show two mistakes found in the 
NIST test setting. The focus of this test is the peak heights in the Discrete Fourier 
Transform of the sequence. The purpose of this test is to detect periodic features 
in the tested sequence that would indicate a deviation from the assumption of 
randomness. The intention is to detect whether the number of peaks exceeding 
the 95% threshold is significantly different than 5%. The test description in the 
NIST document are follows. 

1. The zeros and ones of the input sequence (e) are converted to values of -1 
and +1 to create the sequence X — x\, x<i, ■ ■ ■ , x n where Xi = 2e^ — 1 

2. Apply a Discrete Fourier Transform on X to produce: S = DFT(X). A 
sequence of complex variables is produced which represents periodic compo- 
nents of the sequence of bits at different frequencies. 

3. Calculate M = modulus(S') =\ S' |, where S' is the substring consisting of 
the first n/2 elements in S, and the modulus function produces a sequence 
of peak heights. 

4. Compute T = \/3n = the 95% peak height threshold value. Under the 
assumption of randomness, 95% of the values obtained from the test should 
not exceed T. 

5. Compute N — 0.95n/2. N is the expected theoretical (95%) number of 
peaks that are less than T. 

6. Compute Ni — the actual observed number of peaks in M that are less than 
T. 

7. Compute d = 



Nt-Ng 



A /n(0.95)(0.05)/2 ' 

8. Compute P-valuc = erfc(^L). 
3.1 The derivation of the threshold T 

First, we show the derivation of the threshold T — V3n. For a frequency j, DFT 
arc defined by following equation. 

Sj = ^ XkCos(2ir j) + i^^Xksin(2ir j). (5) 

fc=i 11 fe=i n 



Let us consider the square of modulus of Sj , 



Sj\ 2 = c 2 j+S i (6) 



where 

_(fc-l) 



Cj =J2 x kCos(2TT [ - >-j) (7) 



fc=i 



s 3 = ^2x k sin(2Tr— — j). (8) 



fe=i 
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Here, we can simply prove that Cj and Sj converge to the normal distribution 
whose mean [i is zero and variance a is n/2 under the assumption of Xk (—1 or 
+1 for k = 1, 2, • • • , n) randomness. Therefore, Y — (^) 2 + (^r) 2 converges to 
following distribution function (x 2 distribution with 2 degree of freedom) , 

P{Y) = \eM~)- (9) 

If we transform Y to Z = -j, we can get following distribution, 

P(Z) = exp(-Z). (10) 

The threshold T is defined such that the number of peaks exceeding the threshold 
T should be 5% under the assumption of randomness. Since 

/>oo 

/ cxp(-Z)dZ = cxp(-Zc) = 0.05, (11) 

JZc 

we can get the value Z c = -/n(0.05) = 2.995732274. From \ Sj \ = VnZ, we 
conclude that 

T = V2.995732274n. (12) 

We have found that the deviation of \/3n from V2.995732274n makes the 
distribution invalid. Figure 3 shows the distribution of JVi in SHA1 case (300000 
samples of n = 10 6 bit sequence). Note that the expected value of N\, that 



fi 0.002 




correct threshold 
wrong threshold 



4.755e+05 4.76e+0S 



Fig. 3. The distribution of Ni in SHA1 case (300000 samples of n = 10 6 bit sequence). 
Note that the expected value of Ni, that is, No is 475000. 



is, 7V = ^r 1 is 475000 in this case. If we set the threshold T = V3n, then 
the distribution is shifted to the right. So, we have to set the threshold T = 
V2.995732274n. This is the first mistakes in DFT test. 
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3.2 The theoretical distribution 

Because we use the real values Xk, the symmetry such as | Sj \ = | S n -j \ 
appears in peaks. So, the NIST focus on the first § peaks. The test description 
in the NIST documents use the theoretical distribution whose mean value fi is 
^ and variance value er 2 is ^ where p = 0.95, q = 0.05, and n = 10 6 (§ 
times coin tossing with probability p and q). However, this coin tossing is not 
independent process. The quantity YTj" 1 &j ^ s conserved in this process. In this 
case, the variance a 2 becomes ^f 2 . Figure 4 shows the fitting of the distribution 
of Ni in SHAl case with the threshold T = V2.995732274n and two theoretical 
distributions. We can confirm that the distribution becomes to fit to the new 

0.005 

0.004 
». 0.003 
* 0.002 

0.001 


4.7. 



Fig. 4. The fitting of the distribution of Ni in SHAl case with the threshold T = 
\/2.995732274n and two theoretical distributions. 

theoretical distribution. 

4 Corrections of Lempel-Ziv Compression Test 

In this section, we focus on the Lempel-Ziv test, and show two mistakes found 
in the NIST test setting. The focus of this test is the number of cumulatively 
distinct patterns (words) in the sequence. The purpose of the test is to determine 
how far the tested sequence can be compressed. The sequence is considered to be 
non-random if it can be significantly compressed. A random sequence will have 
a characteristic number of distinct patterns. The test description in the NIST 
document are follows. 

1. Parse the sequence into consecutive, disjoint and distinct words that will 
form a "dictionary" of words in the sequence. This is accomplished by cre- 
ating substrings from consecutive bits of the sequence until a substring is 
created that has not been found previously in the sequence. The resulting 
substring is a new word in the dictionary. 
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2. Compute P-valuc = \er f c{ ^J^ s ) , 

where fj, = 69588.2019 and a 2 = 73.23726011 when n = 10 6 (these values 
are updated Oct. 26, 1999). Note that since no known theory is available to 
determine the exact values of and a, these values were computed using 
SHA1. 



4.1 The asymmetric distribution 

There are asymptotically well-approximated mean formula and the variance for- 
mula of the distribution of the Lempel-Ziv test [14,15]. However, it is known 
that above formulas are invalid for the sequence of length less than 10 7 through 
a simulation study using BBS. Therefore, SHA1, which is one of well-known good 
PRBGs, is used instead for the mean-value and the variance-value in the NIST 
setting [8] . The accuracy of such empirical estimates depends on the randomness 
of the generator used. Figure 5 shows the distributions of the number of words 
in SHA1 case and CA case (10 6 samples of n = 10 6 bit sequence). Two distribu- 
tions are almost the same although two algorithms are completely different. We 
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69540 69560 69580 69600 69620 69640 

The Number of Words 

Fig. 5. The distribution of the number of words in SHA1 case and CA case. 10 6 samples 
of n = 10 6 bit sequence are used. 



can confirm the subtle asymmetries if we see Fig. 6 carefully. We conclude that 
this distribution can be used for the mean and variance values of new setting 
of the test. Through the fitting of the distributions, we got the mean value \x 
= 69588.09 and variance values a\ = 75.574336518 and cr| = 72.42178447, for 
the left branch and right branch, respectively. Consequently, we got the new 
empirical estimates (asymmetric distribution) which are better than the NIST 
setting. 
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Fig. 6. The distribution of the number of words (SHA1 case) in different scale. The 
horizontal axis denotes the square of distance from the mean value for both branchs. 
The same data of Fig. 5 is used. 



4.2 The effect of discreteness 

Despite the best fitting of the distribution, the uniformity of P- values can not be 
improved. This is because the distribution of the number of words is too narrow 
(the variance is too small). Therefore, the effect of discreteness appeared. In other 
words, a variety of the appeared P-values is limited. Figure 7 shows the number 
of times of appeared P-values in SHA1 and CA cases. Because the variety of 



oCA 
+ SHA1 



I 20000 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
P-value 



Fig. 7. The number of times of appeared P-values in SHA1 and CA cases. 10 samples 
of n = 10 6 bit sequence are used. The numbers described in figure denote the variety 
of appeared P-values in each bin. 



appeared P-values are two or three in centered bins, we never get the uniformity 
of P-values in this situation. 
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Because the purpose of checking the uniformity of P-value is to detect the 
deviation of the distribution from that of random sequence case, we re-define 
the uniformity of P-values only in this test case as the histogram of P-valucs 
itself which is produced by SHA1 and CA5 (10 6 samplcs). In other words, we use 
following formula for the checking of the uniformity instead of eq.(2), 



where m denotes sample size and Si denotes the rate of each i bin which is 
computed from the histogram of P-values (10 6 samples of SHA1 and CA data), 
that is, Si = 0.1097085, S 2 = 0.079127, S 3 = 0.107691, 5 4 = 0.084465, S 5 = 
0.1369235, S 6 = 0.091115, S 7 = 0.0858035, S$ = 0.1098615, S 9 = 0.1028565, 
and Sio = 0.0924485. 

5 Conclusion 

We corrected two points for DFT test setting, 

1. The correction of the threshold T from V3n to \/2.995732274n. 

2. The correction of the variance a 2 of theoretical distribution from ^ to 222. 

We also corrected two points for Lempel-Ziv test, 

1. The setting of standard distribution which has no algorithm dependence. 
This asymmetric normal distribution has its mean value ji — 69588.09 and 
variance values u\ = 75.574336518 and a\ = 72.42178447, for the left branch 
and right branch, respectively, in n = 10 6 case. 

2. the rc-definition of the uniformity of P-values as the histogram of P-valucs 
itself which is produced by SHA1 and CA5 (10 6 samples). 

Figure 8 shows the P'-values behavior after corrections when the number of 
samples increases. As a result, P'-values of two test become improved (compare 
with Fig. 2). 

Although the checking of the uniformity of P-values was not executed in the 
evaluation of AES candidate algorithms, the used P-value itself has nonsense in 
these two tests. This suggests that re-evaluation of the test results should be 
needed. 
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